- Regression trees
- Classification trees
- Bagging and Random Forests
- Variable Importance measures
- Boosting trees
11/05/2019
\[\widehat{Y} = \widehat{f}(X) = \text{Tree}(X)\]
At each internal node there is a decision rule of the form \(\{x < c\}\): if \(x < c\) go left, otherwise go right
Within each region (interval) we compute the average of the \(y\) values for the subset of training data in the region
Now the decision rules can use either of the two \(x\)’s.
lstat
(\(\%\) lower status of the population) is the most important factor in determining medv
(median value of houses), and houses with lower lstat
have higher values compared to houses with larger lstat
lstat
, the distance to employment centres seems to play little role in the house value\[\text{ }\]
However, among neighbourhoods with smaller lstat
, the distance to employment centres does affect the house value, and houses that are closer to employment centres have higher values
Let’s see a deeper version of the previous tree.
Notice the interaction: the effect of dis
depends on lstat
!
As we saw, the key idea is that a complex tree is a big tree. We usually measure the complexity of the tree by the number of leaves.
However, in practice two other measures are preferable: the Gini index and the entropy.
The Gini index is defined by \[G_m = \sum_{k=1}^K \hat{p}_{mk} (1 - \hat{p}_{mk}),\] where \(\hat{p}_{mk}\) represents the proportion of observations in the \(m^{th}\) region that are from the \(k^{th}\) class. It is a measure of total variance across the \(K\) classes.
An alternative to the Gini index is entropy, given by \[D_m = - \sum_{k=1}^K \hat{p}_{mk} \log(\hat{p}_{mk})\]
Hockey penalty data: the response is whether or not the next penalty is on the other team and \(x\) contains a bunch of variables about the current game (score, …).
The form of the decision rule cannot be \(\{x < c\}\) for categorical variables: the tree picks a subset of the levels to go left. inrow2
= 0 means that all the observations with inrow2
in the category labeled 0 go left.
If:
then there is a \(72\%\) chance the next call will be on the other team.
\[\text{}\] Whilst there is another game situation where the chance the next call is on the other team is only \(41\%\).
Good:
Bad:
If we bag or boost trees, we can get the best off-the-shelf prediction available.
The idea is to combine a large number (hundreds, thousands) of trees to get an overall predictor.
Ensemble methods: we can improve predictive accuracy at the expense of some loss of interpretation.
Details in the next class…
What tree has larger variance?
What would you predict for the value of a house with lstat
= 6 and dis
= 3?